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(54) A method and a device for recognising speech 



(57) In a speech recognition method and apparatus, 
according to the present invention, feature vectors pro- 
duced by an analysing unit of a speech recognrtion de- 
vice are modified for compensating the effects of noise 
According to the invention, feature vectors are normal- 
ised using a sliding normalisation buffer (31). By means 



of the method according to the invention, the perfomi- 
ance of the speech recognition device improves in situ- 
ations, wherein the speech recognition device's training 
phase has been carried out in a noise environment that 
differs from the noise environment of the actual speech 
recognition phase. 
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present inventfon. for improving speech r^^nS ^^P""' «'=«'^«^'"9 to the 

ovM^tLTHM^sS^^^^^^^^^ 

recognition phase, oLrvations and ^'Ss^,^'"^^^^^^ statrstic models of recogn^ble words. At the 
and. based on probabilities, a model, sfo^iTn '^'^"'^'^ ^ P'«^o"nced word 

tothe pronounced word, isdetemiined For exiifZo^^^^^ 

Modeb. has been described in the ret^nce T Rabfn based on the Hidden Marko" 

tions^in spe^h recognition: Proceedlngs":nhe^Sr^^ 
anol^e—— P-^~ 

conditions during the operation of the ZlT^^^'^-^XZ^T 'T' '""^^^^ ^'^^'^"^'^^ 
of the speech recognition device. This is. inS on^ofmrrrotS^^^^ 

systems in practice, because It is imp<«sib etSte?! !!^,^ '° ^P^'' 

a speech recognition device can be ^ZTn^^lCi^o^^T^f^^ °' ~'"«nts. wherein 
is thatthe speech recognitior, device's trarniSa^elo^Tvr^^^^^ ' """''"^ " 'P^^"^' '^-vice 
the speech recognition device's operaliSiSS-r^ 
.n.ndingtra«icand.thevehlcleLMLs.c^^^^^^ 

.deS:s^'::~:ss:^^ 

phase of the speech recognition device t^aZ^ ^>^u^T' T microphone Is used at the training 

recognition device decreases sUbSJ P'^'"' the perfom,ance of the speech 

vectornrefrtJeTptrreX^^^^^^ ♦J-"- of no.e In the ca.u.ation of feature 

station applications, wherein speech is^ognS^^^^^^^ "'"^ '"'^^ computerAvork 

to be recognised is stored in a memory o^a Ste" TypSv ZHnl I!, T""' °' "^'^""^ ^^^^-^^ 
seconds. After this, the feature vectoLre mSSSf utSf j^ 
of the entire flle..Due to the length of the spee^sSa o 

time speech recognition." ^ ^ "^"^"^ ^*°^^'^' '»^^se kinds of methods are not applfcable to real- 

isatlon coefficients are updated with^e^ S^-Z the nTnSS^T^^" '° '"'"'•''^ "'^''"9. the nom«.. 
practice. In addition, this method requires a VAD the^erat^Srii^^'^ ' °^ "^"^ «"°"9'' 
applications with bwslgnal to noisrratfo (SNR) vaiuef SSi- 11'°° '"^"'"'^'^ '"^^P^^^'^ ^^°9""*°" 
tosaiddelay ■ . '^'v ' " *®®**''^'"^*'°^'"«et.therea^^^ 

and,'b;rearorwSsr^^^^^^^^^^^ 

noise. The modification of the fLt^^ot^S 'ST T '° '^^""^"^'^ ««««^« 

feature, vectors and by nom^lislng the feaL veTto ^sino Itl Z ^ T" '"^"^ '^^^•'^"^ '^^^ 

of , the present Invention, the featL ^^JZ arTZSJSliS^J^^^^^ "^""""^ '° " embodiment 
invention, the updating of the normalisatto^ Lram«?r,^fh f ? nomialisatton buffer By means of the 

the delay in the actua..orma.lsati^ SrssT;S^ 

to be implemented. ""^^^^ '° a feal-time speech recognltkxi applcatlon 

aspirL^^r™^ 

performance of the speech recognition devl fe a^eZ I^T . f ^T^"" °' '"^^«°"' ^" ^l^o^' as high a 
experimental and recoqnitk)n ohase ot ty^l ZlZ^^'Z: u ^^T"""' « "liferent microphone is used at the 

isusedatboththetrai;,ingandrecc^nitl^^s!l^^ ^"^"^^^^ 

The invention IS character^ed in What has been presents in the characterfelng part^^^^^ 

Hiustrates the structure C a speech recogn«lc. devtee. according to prior art. as a bkxk d^- 
F.-9ure2 »^ ^t-'ure of an analyse block. accord.g to prl^ 
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Figures 3a and 3b illustrate the structure of a speech recognition device according to the invention. 
Figure 4 illustrates the use of a normalisation buffer'accbrdrng to the invention, 

5 Figure 5 illustrates the operation of a method according to the invention, as a -flowchart, and 

Figures * illustrates the structured a'mobile station according 

Figure 1 illustrates the block diagram stru6ture of a known speech recognition device as applicable to the present 

10 inventibri'. Typically.' the operation of the speech recognition device '©'divided into two 'diff erent main activities: an actual 
speech recognition phase 10-12, 14-15 and a speech training ph^se-i 3 afe' illu^trated in Figure 1. the speech recog- 
nition device receives from a microphone as Hs input a speech signal s(nf. which is transfornied into a digital form by 
means of an A/D converter 10 using, e.g.. a sampling frequency of 8 kHz and a 12 bit resolution per samjale. Typically, 
the speech recognition device corriprises a so-called front-end 11 , wherein the speech signal is analysed and a feature 

IS vector .1 2 is modelled, the feature vector describing the speech signal during a ispecitic period. The feature vector is 
defined,- e.g."i'"at 10 ms inten/als. The feature vector can be hidde«edu3ihg ^evisral different techniques. For example, 
several different kinds of techniques for nrKxlelling a feature vectbr'have been p^resent'ed in the^'reference: J. Picone. 
■Signal modelling techniques in speech recognltibn^ IEEE'Pr<i;eediHgs, VoT.-Sl .HSIo. 9^ pp. 1215-1247, September 
1993^ The" feature' vector used in the present irivehtiort'is'mddelley'by defining scncalled Mel-Frequency G^pstral Co- 

20 efficients (MFCC). Duringihe training phasW," modeTs Mie cdnstj-uct^d by meansof the feature yectbi-; in a trairiing block 
13 of the speech recognition device, for the words used by the speech recognition device. In moder training 13a, a 
model is determined for a recognisable word. At the training phase, repetition of the word to be nrKXlelled can be utilised. 
The modelsTare stored in a memory 13b."Du ring 'speech reebgnitidn; 'the feature vector is.transrhitted to ah actual 
recognition device 1 4, which compares, in a bibck 1 5a, the mbdefsj cohst?ucted during the training phase; to the feature 

25 vectors, to be constructed of the recognisable speech, and the decision 6h a'recognitk>n i^esult is made fn a block 1 5b. 
The recognition result 15 denotes the word, stored in the memory of the speech recogriition device; that best corre- 
sponds to the word pronounced by a person using the speech recoghittoh device: ' 

Figure 2 illustrates the structure of a known analysing block'of the frOht-end 1 1 . applicable to the present invention. 
Typically, the Iront-erid 11 comprises a pre-emphasiising filter 20 for emphasising frequencies relevant to speech rec- 

30 ognitidn. Typically, the pre-emphasis filter 20 is a high-pass filter, e:g.. a -Ist degree FIR filter having a response of H 
(z)=1-0;95z~V Next; frames, N samples in length, are formed'of a filtered signal in a bbick2lT By using, e.g., a sample 
length N=240, a frame structure of 30 ms is produced at the samplirig frequency of 8'kHz. Typically, the speech frames 
can also be formed using a so-called overlap technique, wherein successive frames overlap to the extent of S islicbes- 
sive samples (e.g., 10 ms). Before nrKxJelling a Fast Fourier Transform (FFT) frequency representation for the speech 

35 signal in a block 23, so<alled windowing can also be carried out In order tbimprove the accuracy of a spectrum estimate 
using, e.g., a Hamming window in a block 22: Next, the FFT representation of the signafis transf omired into' a Wei 
frequency representation In a Mel windowing block 24. The transformation into the Mel freqtiehcy representation is 
known as such to a person skilled in the art'. The transfer to Mel frequency representatfon Has been presented' in 
the source reference: "J. Picone, 'Signal Modelling Techniques in Speech Recognition", IEEE Proceedings, Vol. 81. 

40 No. 9". With this frequency transformation, the non-linear sensitivity of th6 ear to different frequencies^ is taken into 
considbration^ Typically, the number (k) of the frequency banids ii^ed cah^e k=24: The actual feature vector 12. i.e.. 
the so-called cepstral coefficients c(i) are obtained by carrying but a^sd-called discrete cosine 'trahsfbrniatibn (DCT) 
for 26 logarithmic Mel values, formed in a block 25. For example, the number of degrees J=24 can be used in the 
discrete'cosine transformation. Typically, only half of the DCT coefficients c(i); wherein i is the index of a cosine'term, 

45 is used. Typically, the actual feature vector also contains information on speech dynamics by cateulating so-called 1st 
and 2nd stage difference signals dc(i), ddc(i). These difference signals-can be determineld frorfi the successive output 
vectors of a discrete cosine transformation bkx:k, in a btock 27, by estimating that dc(i)=c(i)-c(i-1 ) and ddc(i)^dc(i)-dc 
(i-1). When these 26 additional parameters are taken' into account, the length bf the feature vector, in our exemplary 
case, is 13+26=39 parameters. ' • . . c. . 

so Figures'3a and 3b illustrate the s^tructure bf the speech recognition device accbrding to a first embodiment of the 

present invention. A front-end 30 produces; as an output signal, a feature vector Xj. i=1..M (e.g.. M=39), at 10 ms 
inten/als. The feature vector Is stored in a normalisatbn buffer 31 , by means of which a mean value ^j and a standard 
deviation a[ are cak:ulated for each feature vector component x1 . i=1 . ..M as follows: 

55 . . • • • - • ^ . 
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In ,tHe formulas (1 ) and (2). N is the length of the normalisation buffer and M is the length of the feature vector. 
15 After this, the component of the feature vector to be, recognised is nprniaHsed in a block 31 using the calculated 
normalisation coefficients Hj, Oj. The feature vectpr^^X to be normalised and recognised is located in the middle of the 
normalisation buffer 3 J as illustrated Jri Figure 4. . - : ^ 

The normalised feature vector X is transmitted, as, an input signal either to the speech recognition unit 1 4 or to the 
training block 13 depending on whether thaquestion-is of the training phase of the speech recognition device or of the 

25 actual speech recognition phase. In the method accprding to the first embodiment of the present-invention, a normal- 
rsatbn buffer fixed in length (N) Is preferably used, the buffer being slid over the feature vectors. Due to the sliding 
normalisation buffer, the method can also be irriplemented in a real-time speech recognition system. A normalisation 
buffer 34 is a buffer N*M samples in size, which can typically be implemented in connection with the speech recognition 
unit by programming a digital signal processor (DSP) using either the internal memory structures or the external memory 

30 of the DSP. In the solution, according to the example of the present invention, the normalisation buffer is 100 feature 
vectors in length. The feature vector to be normalised and recognised at any one time Is located in the middle of the 
normalisation buffer 34. Because the feature vector to be. normalised is \oca\e6 in the mkidle of the normalisation buffer, 
a delay N which Is. of the normalisation buffer's length is caused in speech recognition. When using the parameters of 
the example, the delay is 100*10 ms=1 s. However, this delay can be halved by using only part of the buffer's length 

35 at the beginning of speech recognition as explained in the following. ' ' - 

Figure 5 illustrates, in the form of a flowchart, the, operation of the method according to the present invention. At 
the beginning of speech recognition, the norrnaiisatlon buffer is filled for as long as one half of the buffer's full length 
N/2 has been used (blocks 100-102), After this/,the mean value and standard deviation vectors ^j, Oj, (block 103) are 
calculated and a first feature yector is nornialised using jhe first N/2 feature vectors. The actual speech recognition 

40 process is carried out for this normalised feature vector X using Vrterbi decoding in a block 15b (Figure 1) according 
to a known technique. Next, a new feature vector is buffered (block 104), new normalisation coefficients are cateulated 
using the (N/2+1 ) stored feature vectors and a second feature vector is normalised and recognitfon is carried out with 
It (block 103). The con-esponding process is continued until the normalisation buffer is full. Then, atransfer is made, 
in'the flowchart^ from a block 105 to a block 106.' This means that the first N/2 feature vectors have beeri recognised 

45 iarld the feature vectors to be normalised are located in the middle of the nomnallsation buffer. Now, the buffer Is siki 
according to the FIFO principle (First In-First Gut) so that after a new feature vector has been cateulated and recognised 
(block 1 07), the oldest feature vector is removed from the nomnallsation buffer (block 1 06). At the end of the recognltton 
phase (block 107), the normalisation coefficients are cateulated using the values stored in the normallsatbn buffer. 
These same normalisation coefficients are used in connection with the recognition of the last N/2 feature vectors. Thus, 

so the rriean values and standard deviations are calculated using hoh-normalised feature vectors. When speech recog- 
nition has been can^ied out with all the N feature vectors (block 108), the speech recognition device models a result of 
the recognisable word (bk>ck 109). 

According to a second embodiment of the present invention, the length of the normalisation buffer may vary during 
speech recognition. At the beginning of speech recognition, it is possible to use a buffer shorter in length, e.g., N=45, 

55 and the length of the signal to be buffered can be increased as speech recognition progresses, e.g., for each frame 
(30 ms): Thus, as an exception to the first-exemplary applteatlon of the invention, the feature vector to be normalised 
can be the first feature vector loaded into the buffer and not the middle feature vector of the buffer, and the buffer's 
entire contents of that partteular moment dah be utilised in the calculation of the nomnalisation coefficients. In this 
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app.^r«n..he.eng..on.e.e.vlsN.N.eln,tHe,eng.o.ase«n,ent.^ 

are nom^lteed. but instead nom^alisattonjs carn^^ 
normalisation can only be carried outfor the most .mpo^ 

-^:e%T:jrr=^^^^ 

utilising the present invention. The mobile statK>n '^'^^J^^'^^'^^^^^^^ ,U mobile statton's operation. 
Keyboa'r d 62. a display 63, a speaker 64. as well ^^^^^''^W^S ometobile station. The control block 
In addition, the figure shows transmission and recept«n ^^^^ 67 68 ^^^^^ vVhen the 

65 also controls the operation o. ^^^^^^^^^^^ recognition device or during the 

speech recognition device is actuated erther ^"""9 ^'^^f 1^^^^^^ controlled by the control block, 

actual speech recognition process, audo can also be transmitted through 

a DSP and it comprises ROM/RAM memory circurts -^^^^^^^^^^^ method according to the present in- 
Table 1 illustrates the perlomiarKe of a speech ed with the use of non- 
vention, compared ^th other noisecor^ensat-^^.^^^^^^^ 

normalised Mel-frequency cepstral ^^^'^^^^^^Jjl^^^^^^^ environment. During speech recognftton 
carriedoutusingahidden Markov modelthathasbeenmod^^^^ 

a noise signal has been acWed to the word^o be recogri«e^^ ^ ^^^.^^ 3^^„3, 

•Clean. r,«de corresponds to a situation; wherein ^^^.J j,"^";^^^^^^^^ ,est results show that the speech 
speech recognition process have been earned out ma ""^^^thSStty of a recognition device particularly in a 

rUnilion device, accdrdihg to the present .r«^^^^^ devLi ' according to the present 

according to the invention. * ..f . r . "'^ - - ' 
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Environment (SNR) 
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Norm. Feature vectors 


Clean . 


, 96.5% 


96.6% 


:97.5% 


5dB 


95.0% 


95:3%; 


^ - 96.1% ' 


OdB 


93.7% 


.94,9%, 


. 95,9%. ;: 


-SdB 


89.3% 


93":'0%' 


■•^■95.3%-'-' - ■■ 


-10 dB 


. 73.8% 







.;r;;f^s paper presents the implementa^ and -f^^^^^^ However. 

example..theinventionhasbeenpresentedabove.nasp^^^^^ 

,Ke invention is also suitable for use in speech rec«gn,^^^^ 

applied, for example, to speech recognrtran devices -^'"a above, and that the inyenlfon can 

ihTt the present invention is not restricted to f l^'^f.^^^^'^XSr^Tof t^ invention...The embodiments 

also be implemented in another '^-^^J^f;^;"^^^^^^^ implementing and.using the 

presented shoukJ be considered .llustrat,ve^ birt mt resU^^^^^^ 

invention are only restricted by also be^ to the scope of- the Invention, ■ 
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m<xriM paomel., and means II 4) to recoontaTnL^T?,' ' ^ ' ^ 

said n»ans 13. , te» a"SSS^» '"^ « W»~>« «l.".ct.,l,.d In M 
means (3,)lo,nM»,i„5 ins pa,an,.„,hSC'S^"',f'^^ 

<..sn.siso,,n.pa™.,s,si„sdp.*d.^:ts:;:r^x™ 

SlS^."- '.<:>^^^,^ ^ inal .aid «o.,^ n^ ,3,, comp*. , ,„,^, , 
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